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Abstract: Wc present two sets of theoretical results on the grouped lasso 
with overlap due to Jacob, Obozinski and Vert (2009) in the linear regres- 
sion setting. This method jointly selects predictors in sparse regression, al- 
lowing for complex structured sparsity over the predictors encoded as a set 
of groups. This flexible framework suggests that arbitrarily complex struc- 
tures can be encoded with an intricate set of groups. Our results show that 
this strategy results in unexpected theoretical consequences for the proce- 
dure. In particular, we give two sets of results: (1) finite sample bounds on 
prediction and estimation, and (2) asymptotic distribution and selection. 
Both sets of results demonstrate negative consequences from choosing an 
increasingly complex set of groups for the procedure, as well for when the 
set of groups cannot recover the true sparsity pattern. Additionally, these 
results demonstrate the differences and similarities between the the grouped 
lasso procedure with and without overlapping groups. Our analysis shows 
that while the procedure enjoys advantages over the standard lasso, the set 
of groups must be chosen with caution — an overly complex set of groups 
will damage the analysis. 

Keywords and phrases: Sparsity, Variable Selection, Structured Spar- 
sity, Regularized Methods. 



1. Introduction 

In this paper, we consider the hnear regression model: y = X/3° + e, where X is 
an n X p real valued data matrix, y € R" is a vector of responses, /3° S M'' is a 
vector of linear weights, and e is an error vector. Much work focuses on estimat- 
ing a sparse /?, where many of the entries are equal to zero, effectively excluding 
many of the dimensions of X — the candidate predictors — from the model. 
Recent work adds the notion of structure to this setting. That is, we desire the 
set of nonzero entries in /? to follow some predefined structure over the candi- 
date predictors. There are now many methods tailored to a diverse collection of 



*This work is part of the author's Ph.D thesis. 

tThis work was funded by the National Institutes of Health grant MH057881 and National 
Science Foundation grant DMS-0943577. The author was also partially supported by National 
Institutes of Health SBIR grant 7R44GM074313-04 at Insilicos LLC. 

tThe author would like to thank Larry Wasserman and Aarti Singh for helpful comments 
and discussions, as well as the two anonymous referees whose suggestions helped greatly 
improve the contents of this paper. 



1 

Imsart-ejs ver. 2009/08/13 file: percival-overlap-theory-revl.tex date: January 16, 2013 



Percival/Theory of Overlapping Group Lasso 



2 



structures, including hierarchical structures, group structures, and graph derived 
structures: see Bach (2010a, 2008b, 2010b); Huang, Zhang and Metaxas (2009); 
Jenatton, Audibert and Bach (2009); Jenatton, Obozinski and Bach (2010); Peng et al. 
(2010); Percival et al. (2011); Kim and Xing (2010); Zhao et al. (2007) for ex- 
amples. 

One such structured sparse method is the grouped lasso of Yuan et al. (2006), 
which extends the familiar £i penalization to a grouped £i norm. In particu- 
lar, the grouped lasso allows for groups of predictors to enter the model to- 
gether, a useful property in settings such as ANOVA or multi-task regression. 
For example, we can encode a factor predictor with m levels as m — 1 indica- 
tor variables in X. When we build a sparse regression model, we might prefer 
to select none or all of this group of m — 1 variables, but not any other sub- 
set. The grouped lasso enables this type of grouped selection. Bach (2008a); 
Chesneau and Hebiri (2008); Huang and Zhang (2010); Lounici et al. (2009); 
Nardi and Rinaldo (2008) give theoretical results for this procedure, including 
oracle inequalities and asymptotic distributions. In particular, they showed that 
for some problems the grouped lasso outperforms the ordinary lasso. 

However, the grouped £i norm of the grouped lasso is limited in that it only 
allows groups that partition the set of candidate predictors. This restricts the 
complexity and types of structures over the candidate predictors that can be 
encoded in the groups. For example, we could represent the potential structures 
of Z?*^ as a a graph over p nodes, where each node represents a candidate predic- 
tor. We might then seek to build a sparse model where the selected predictors 
correspond to a subgraph, such as a neighborhood or clique, of this graph. This 
structure can be encoded, for example, as a series of overlapping neighborhoods, 
such as 4-cycles in a 2-dimensional lattice graph. The grouped £i norm does not 
allow for such a set of groups. 

The grouped lasso with overlap of Jacob, Obozinski and Vert (2009) is one 
solution to this problem (see also the CAP penalty of Zhao et al. (2007), as well 
as other group based procedures in Bach (2010b); Jenatton, Audibert and Bach 
(2009)). Using an extension of the grouped norm of the grouped lasso, this proce- 
dure allows for complex, overlapping group structures. Given a collection of sub- 
sets of the set of candidate predictors, the procedure recovers nonzero patterns 
equal to a union of some subset of this collection. This property can encode many 
complex structures over the candidate predictors, and thus within the resulting 
sparsity patterns of the estimated coefficients. While Jacob, Obozinski and Vert 
(2009) gave some initial theoretical results on this procedure, including a consis- 
tency result, many theoretical questions were left open. In particular, the impact 
on the predictive and estimation performance of the procedure of increasingly 
complex sets of groups remained unanswered. The overlapping nature of the 
groups allows the possibility for an arbitrarily large set of groups to encode 
complex structures, or many possible structures simultaneously. If we suppose 
that there is no consequence to increasing the number and complexity of the 
groups, then we can freely run the procedure under many structural conditions 
simultaneously. 

The concluding remarks of Huang and Zhang (2010) indicate that the grouped 
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lasso docs not perform well with overlapping groups. The goal of this paper is 
to expose exactly how introducing the possibility of overlapping groups impacts 
the grouped lasso. Towards this goal, we demonstrate some theoretical proper- 
ties of the overlapping grouped lasso, with a focus on the consequence of the 
number and complexity of the groups of predictors. We give a finite sample and 
an asymptotic result. In particular, we make the following contributions: 

1. We show that both the finite sample and asymptotic performance of the 
overlapping grouped lasso suffers as the number and complexity of the 
groups grows. 

2. In the finite sample case, we show that the assumptions on the design 
matrix X become more restrictive as the complexity of the groups grows. 

3. In our asymptotic analysis, we introduce the adaptive overlapping grouped 
lasso, and give an adaptive weighting scheme with asymptotic selection 
guarantees similar to the adaptive lasso of Zou (2006) (see also the adap- 
tive grouped lasso results in Nardi and Rinaldo (2008)). 

Overall, we conclude that the overlapping grouped lasso enjoys many of the 
same theoretical guarantees as the grouped lasso, provided that the set of groups 
are not too complex or large. We therefore recommend that the procedure should 
be used with a set of groups that is not overly complex, or contains a nested 
structure. 

The paper is organized as follows: we first introduce notation for the overlap- 
ping grouped lasso. We also reproduce some basic theoretical properties of the 
procedure and the associated overlapping grouped norm. We next give our finite 
sample results, and then our asymptotic results. We then present a simulation 
study to support our theoretical results. Proofs of the main results along with 
supporting lemmas appear in the appendix. 

2. Notation 

We adopt a combination of the notation of Jacob, Obozinski and Vert (2009) 
and Lounici et al. (2009). Recall our basic setting, the linear model: 

y - X/3" + e. (1) 

Here, X is an n x p data matrix, y £ R" is the response, G M^* is a vector of 
true linear coefficients, and e is a stochastic error term. Our goal is to estimate 
a sparse /?, such that the nonzero entries follow some structure which we assume 
to be known a priori. In particular, we consider structures defined in terms of 
groups of predictors, which we define as subsets of the set of candidate predictors 
indices: I ~ {1,2, We denote a collection of groups as G with elements g 
such that each g C I. Let \G\ = M, and assume (J g = I. For coefficient vectors 

see 

/3, we define /?£, e K.lf ' as the sub vector consisting of the entires corresponding 
to the indices in g. Define the support of a vector as: 

supp(/3) = {z : /3, ^ 0} C I. (2) 
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Wc now give a framework, proposed by Jacob, Obozinski and Vert (2009), 
to measure the structured sparsity of vectors in W. We define the following 
convention: for vectors denoted Vg £ Rp we have that supp(wg) C g. We define 
a decomposition oi P £W with respect to G as: 

Vg{/3) = {vg:g&g} such that '^Vg = /3. (3) 

geg 

That is, each decomposition in Vg{/3) is a collection of M vectors in R^* each 
satisfying supp(wg) C g for a different g € G. From now on, we suppress the G in 
the notation for decompositions and write V(/3). V(/3) is not unique in general. 
We define the following norms: 

p/2\ 



mm 

V(/3) 




(5) 



.see 



||/3||2,oo,e = niinmax||ug||. (6) 

v(/3) gee 

Here, 1 1 • 1 1 denotes the Euclidean or £2 norm. The above two equations are norms 
by arguments presented in Jacob, Obozinski and Vert (2009). Note that the 
notation miny(^) indicates the minimum over all possible decompositions. Note 
that the decomposition that minimizes these norms is not necessarily unique, 
as we state in the following lemma. 

Lemma 1. (Corollary 1 from Jacob, Obozinski and Vert (2009)) For any col- 
lections {vg}, {v'g} minimizing the norm 5, we have, Vg G G'- 

\\vg\\x\\v'g\\=Oor^ = ^ (7) 

\\Vg\\ \\v'g\\ 

The above lemma implies that in some cases the collection of groups used in 
the decomposition — that is, {g E G s.t. Vg 7^ 0} — is not unique. 

Finally, ioi J G G we write (5j = J2geG ^s^se Ji note that J C {1,2,..., M}. 
Let Jy{l3) = {g : Vg ^ 0}, and M^(/3) = | J„(^)|. Thus, Jy{(3) is the set of groups 
used to decompose /3 for a particular decomposition. M„(/3) is thus a measure 
of the structured sparsity of /? with respect to a particular decomposition. Let 
M{f3) = min„M„(/3), where this minimum is taken over the set of decompo- 
sitions minimizing the norm 5. Thus, M(/3) measures the overall structured 
sparsity of /3, with respect to the groups G- 

Here is a simple example to illustrate the setting. Let p ~ 3, and consider 
the following groups: 

g = {{l,2},{2,3}}. (8) 
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For any a € M, we have the following possible decomposition oi (5 — [a, b, c]: 

V(/3) = {z;{i,2},«{2,3}}, (9) 
"{1,2} = [a,afo,0], (10) 
«{2,3} = [0,(l-a)fo,c]. (11) 

Thus, the norm from Equation 5 can be expressed as: 

||/3||2,i,e = mm (v/a2 + {ab)^ + V((l - aW + c^) ■ (12) 
Finally, it is clear that for a,c ^ 0, M{f3) — 2, and M{f3) ~ 1 otherwise. 

3. Overlapping Grouped Lasso 

Recall our goal, under the model of Equation 1, we estimate the target /3° with 
a sparse /3 - that is, many entires of /3 are set to zero. Additionally, we know 
these nonzero entries occur in a structured pattern, as given by Q . We evaluate 
the fit with the usual quadratic loss: 

^(/3) = i||y-X/3||2. (13) 

The overlapping grouped lasso solves the following optimization problem: 

^=argmin(£(/3) + 2A||/3||2,i,g). (14) 



Here A > is a tuning parameter controlling the amount of regularization. If 
the elements of Q are restricted to be pairwise disjoint, then the norm || • ||2,i.e 
reduces to the grouped £i norm. We then recover the original formulation of 
the grouped lasso. In the special case where the groups are all singletons: G = 
{{i} : i <S I}, we recover the familiar lasso Tibshirani (1996). If we allow Q to 
be any collection, allowing for the possibility of overlap between groups, then 
the minimum over V(/3) in the norm now plays a role since the decomposition of 
/3 is no longer unique in general. This setting gives us the overlapping grouped 
lasso. For each of these problems, the key fact is that the support of (3 will be a 
union of members of a subset of Q. Finally, we also introduce the the adaptive 
overlapping grouped lasso: 



/3 = argmin £(/3) + 2AminVAg||vg|| . (15) 

As previous work and theory has suggested (Nardi and Rinaldo (2008); Zou 
(2006)), the choice of weights: Xg = l/H/jO^-^lp, where fi"^^ = (X^X)-iX^y, 
and 7 > 0, gives good asymptotic guarantees. In Section 5, we show that a 
different choice is needed in our setting to give similar asymptotic guarantees. 
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Finally, as noted by Jacob, Obozinski and Vert (2009), the overlapping grouped 
lasso method is simple to implement. In the case where G consists of non- 
overlapping groups, there are several efficient algorithms available. In the over- 
lapping case, no new specialized algorithm is required. Write Xg as the sub- 
matrix_of X with only the columns of X indexed by the elements of g. Now 
define X = [XgJ^gg — a n x \g\ matrix of the concatenation of the columns 
of X corresponding to each group in Q. We then can solve the optimization 
problem with with a new, non-overlapping, set of groups G defined on the ap- 
propriate columns of X. Since Q is now a non-overlapping set of groups for X, 
we can simply apply existing algorithms for the grouped lasso. 

4. Finite Sample Bounds 

Wc now give a sparsity oracle inequality for the overlapping grouped lasso. 

This finite sample result is an extension of a result on multitask regression due 

to Lounici et al. (2009), which is in turn built on results from Bickel, Ritov and Tsybakov 

(2009). We first state and discuss our main assumption, which is an adaptation 

of the restricted eigenvalue condition of Bickel, Ritov and Tsybakov (2009) to 

the overlapping grouped lasso seetting. 

Assumption 1. Suppose 1 < s < AI = \G\- Then there exists k{s) > such 
that: 



Here J'^ = {g : g £ Gig ^ J}, and V(A) ~ i^g"} denotes the decomposition 
minimizing the norm ||A||2,i,g. 

In the subsequent results, the integer s measures the structured sparsity of 
the target. There are two key differences between this assumption and other 
restricted eigenvalue conditions. First, it relies on norms of the decompositions 
of vectors, rather than norms of the vector or appropriate sub- vectors. Note that 
the decomposition of A must be a decomposition minimizing the 1 1 ■ | |2,i,e norm. 
As we will discuss later, this condition grows more restrictive as Q becomes more 
complex. The key second difference in the assumption lies in the denominator 
term J2gej Ibg^lL which appears instead of the directly analogous || X^gej ''^g^ll- 
We know by the triangle inequality that || X^ggj^g^ll — SgGJ Ibg^lli '^'^^ 
K is less than or equal to a k' obtained under the analogous assumption. In 
the case of non-overlapping groups, this is an equality, and the assumption is 
identical in this case. 
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We now examine some sufficient conditions for the existence of k(s). Examin- 
ing the numerator of the main quantity defining k{s), we see that y/ A'^X'^XA/n > 

where px is the minimal eigenvalue of X^X/n. Examining the de- 
nominator, we can make the following bounds: 

EiKii^EiKii (18) 

^^Ell^^'ll (19) 

gee 

< IIAIKA/tJoverlap)'/'- (20) 

Here, ^^Qverlap m^-Xjei Sgee ^j£g ^^"^ maximal number of times a candi- 
date predictor appears in the groups of the collection Q. Thus, as long as X^X 
has a nonzero minimal eigenvalue, we are guaranteed to find a k(s) of at most 
(px/-M^/overlap)^^^- particular, for k{s) to exist, it is sufficient for X^X to 
be positive definite. We now state our main result. 

Theorem 1. Consider the model in Equation 1. Suppose \Q\ = M > 2, and 
n>l. Assume that the entries of e are i.i.d. Gaussian with mean and variance 
. Let X be normalized so that the the diagonal entries of X^X/n are all 
equal to 1. Denote M(/3°) < s as the maximum number of nonzero groups in 
decompositions of 13^ , V(/3'^). Let Assumption 1 hold with k = k{s). Let: 



'^'^ y^^^aldlG overlap f ^ A\ogM 

/maxg \g\ 



1/2 



^ = — r 1+ r^^ ] ■ (21) 



Here, A > 8. Define q = ming(pg^)min yAy^ ming I5I/8, 8 log Afj , where pg is 

the maximal absolute eigenvalue of a Cholesky decomposition of X'^Xg, where 
Xg is the sub matrix of X corresponding to the columns indexed by the group g. 
Then, with probability at least 1 — M^^'' , for any solution /3 to Equation I4, for 
all e W , the following inequalities hold: 

-||X(^-/30)||2 < ^ Lax|5| -f A^I^logA/) , (22) 



32a 



1/2 



||/?-/3°||2.i,e < ^ max|5| + A /max|g|logA/ . (23) 

The proof for this result is given in the appendix A. 1.2. The proof relies on 
Lemma 4 given in the appendix A. 1.1. We now discuss the result. 

1. As the set of groups grows, the finite sample guarantees degrade. In Propo- 
sition 1, the prediction and estimation bounds both get coarser as the 
number of groups increases. Note that the set of groups can grow not only 
as the dimension of the problem grows, but also if we encode complex 
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structures over the predictors using Q. Thus, even for a problem of fixed 
dimension p, there is a consequence to choosing an arbitrarily complex set 
of groups. To make this result clear, let the groups be maximally complex: 
= 2-^-, the power set of the set of predictors. Now, as the dimension of the 
problem grows, the prediction bound grows at rate 0(p'^/^), and the esti- 
mation bound at rate 0{p). li \G\ is instead of the same order as p and the 
maximum group size is constant, these rates are instead both O(logp). 
This shows that grouped sparsity achieves the tightest upper bounds if 
both the maximum group size and the number of groups grow at slower 
rates than p. Note the contrast here to the results of Lounici ct al. (2009) 
in the multi-task setting, where a growing number of tasks benefitted the 
procedure. Note that in multi-task setting, the number of observations 
necessarily grows with the number of tasks, contrary to our setting. 
2. As the complexity of the groups grows, Assumption 1 becomes more re- 
strictive. Since k appears in the denominator of both the prediction and 
estimation bounds, the bounds become less tight as k decreases. Consider 
the condition: 

geJ" geJ 

Recall that J is a cardinality s set of groups. Thus, for fixed s, as the com- 
plexity of G grows, the flexibility of the decompositions grows, and then 
more vectors A satisfy this condition. This makes k decreasing as a func- 
tion of \Q\. We also recall that when X^X has a nonzero minimal absolute 



eigenvalue, we know k is at most y Px /MQQ^Qj.-^g^p. As noted earlier, as 
the complexity of the groups grows, ^overlap increases as well, leading 
to a smaller k and in turn inferior prediction and estimation bounds. If 
^overlap '^^^ same order as the number of predictors, then k{s) is of 
order 1/M rather than 1/^/M. This dependence shows that our bounds 
depend equally on the dimension of the problem M and the group com- 
plexity as measured by ^/overlap' '^^^'^ '-'^ lasso or group lasso, 
^overlap ~ giving us no dependence on group complexity, as expected. 
3. The results show that the procedure enjoys an advantage over non- structured 
procedures when (3^ is structured sparse. For example, in the finite sample 
case, none of our bounds depended explicitly on the dimension of the prob- 
lem p. Thus, we can adopt a similar argument to those of Lounici ct al. 
(2009) to show that compared to the lasso, the overlapping grouped lasso 
gives superior results in the case where is structured sparse. That is, 
from Bickel, Ritov and Tsybakov (2009), if we let: 



A = W^> (25) 
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then for A > 2\/2, wc have that with probabihty at least 1 — (p)^ A /8. 

-|l^(Aasso-/?°)ll'<— 2;-logf- (26) 



Thus, if -y/maxg \g \ log Af + maxg |(;| is of smaller order than logp, the 
procedure has a predictive advantage. Since k depends on the structured 
sparsity of the target, this result holds only for structured sparse targets 
which give sufficiently large values of k under our assumption. 
4. In the non- overlapping case, we can recover many results available in the 
literature. Here we have ^/gvcrlap ^ ^' adjust our assumption to match 
the literature, so that the quantity in the minimum is replaced with: 



VATxTxA 



(27) 



Combining this with an application of the Cauchy-Schwarz inequality 
in the last steps of the proofs of the result, we can recover the results 
of Lounici et al. (2009) in the multi-task case. In the case of the grouped 
lasso, we can recover the result from Nardi and Rinaldo (2008). The de- 
pendence on the minimal eigenvalues of the Cholesky decomposition of 
each Xg is related to the conditions given in Huang and Zhang (2010). 
In the settings of Lounici et al. (2009), Pg — ^ for all g. 
We can show a similar result solely in terms of maXg \g\. In particular, 
for: 



^'^V^overlap / \ 

A = max \g\+A log M , (28) 

Vn \ 9 ■ J 

the same results hold with probability for q ^ ming(p~^) min (^A/8, 

This result is a consequence of a simple adjustment for this choice of A in 
the proof of Lemma 4 from the appendix. This alternate result shows that 
as the maximum group size grows, the estimation and prediction bounds 
become less tight, and the probability that they hold falls. 
6. The result does not depend on the any uniqueness assumptions on the 
decomposition of /S'^. The consistency result for the overlapping grouped 
lasso in Jacob, Obozinski and Vert (2009) assumes that the decomposition 
of that minimizes the || • ||2,i,e norm is unique. Our result, in contrast, 
depends only on the maximal structured sparsity of such decompositions. 
Thus, in the case where P'^ docs not have a unique decomposition mini- 
mizing the II • ||2.i,e norm, our results still hold. This is a contrast to the 
asymptotic results of the next section. 

5. Asymptotic Results 

In this section, we consider fixed dimension asymptotic for the adaptive overlap- 
ping grouped lasso as described in Equation 15. These results extend those on 



8 log M 
maXg \g\ 
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the grouped lasso found in Nardi and Rinaldo (2008) to the case of overlapping 
groups. 

To begin, define the set of indices of the true linear coefficient vector /3° which 
are nonzero and zero as the following: 

if = {z : 0}, (29) 
ii-^ = : (3^ = 0}. (30) 

Accordingly, we define X^f as the sub matrix containing the entries with col- 
umn indices in the set H. Similarly, for a p-vector x, let Xh be the sub vector 
containing the entries with indices in the set H. Clearly, H U H'^ = I. However, 
that H and H'^ are not necessarily the union of members of J{/3°) and J{l3^Y, 
respectively. We next define the following three subsets of G related to H and 

Gh ={.9 : 3 C iJ}, (31) 
Ghc ={.9 : g C H'}, (32) 
Gho ={.9 : 1.9 nH\>0; \g n > 0}. (33) 

These are, respectively, the set of groups in which the indices are all nonzero in 
/3° , all zero in , and a mix of zero and nonzero in (3'^ . 
For this setting, we now make the following assumptions: 

Assumption 2. As n oo, X ^ M., where M. is positive definite. 

Assumption 3. The entries of the stochastic term e in Equation 1 are i.i.d. 
with finite second moment . 

Assumption 4. There exists a neighborhood in around 13^ such that the 
decomposition of any vector b in the neighborhood has a unique decomposition 
{Vg} minimizing the norm ||6||2,i,e- In particular, the decomposition {Vg}, min- 
imizing the norm 1 1/3"! |2,i.e is unique. Further, this decomposition is such that 
Vg = for all g G Gho ■ 

Assumptions 2 and 3 are directly taken from the grouped lasso setting. As- 
sumption 4 is another such condition adapted to our setting. A direct adaptation 
would be that there exists some G C Q, such that Ug^cg = supp{P^). This prop- 
erty is implied by Assumption 4. Note that these three assumptions are analo- 
gous to those needed for the consistency result given in Jacob, Obozinski and Vert 
(2009). This assumption also addresses indirectly the issue of identifiability of 
the groups. For example, for M = 3, and Q = {{1, 2}, {2, 3}, {1, 3}}, the tar- 
get (3^ — (a, a, a) does not admit a unique, norm minimizing decomposition 
within any neighborhood. Similarly, we can create the set {1,2,3} in four possi- 
ble ways from unions of members of Q. Thus, this particular Q does not satisfy 
Assumption 4 for some targets. 

In the following result, we consider the adaptive overlapping grouped lasso of 
Equation 15. We now propose a set of weights {Xg} for the adaptive overlapping 
grouped lasso. If we let p'^^^ = (X^X)-iX^y, and let = V(/30-^^) be 
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any decomposition minimizing the norm ||/3'^^'^||2,i,e- Then, let Xg — 1/| jv^^"^! 
This choice of weights gives us our main resuh: 



Theorem 2. Consider the adaptive overlapping grouped lasso. Suppose As- 
sumptions 2, 3, and 4 hold. Let {3°^^ = {X^Xy^X^y, and let = 
V{(3^^^) be any decomposition minimizing the norm ||/3*^"^^||2.i,e- Then, let 
Xg = jpjjcj^^jp? ; for 7 > such that n'-'^'^-'^'/^A — > oo. If y/nX — > 0, then, as 
n — oo; " 

V^(/3-/3") ^ Z. (34) 
Where the above is convergence in distribution. The vector Z has entries: 

ZH-N\H\{0,a^MH'), (35) 
Zh^ = 0. (36) 

Where Mh 'is the sub-matrix of M consisting of the entries with row and column 
indices in H . 

We now make some comments on the resuh. 

1. In the non-overlapping case, our result reduces to previous results from Nardi and Rinaldo 
(2008). In particular, the weights are clearly Xg = l/\\f3'^^^\\'^ . Given this, 

we could ask what is the consequence of simply choosing Xg = 1/||/3^^'^||^ 
for the adaptive weights in any case? In the proof of the result, the impact 
is for the case when g e Gho- In summary, the term is no 

longer Op(l), since > 0. Then, we get the following distribution: 

ZHo^N\Ho\{^,a^MH\), (37) 
Zho^- = 0. (38) 

The resulting distribution is nonzero with positive probability in coordi- 
nates that are zero in /J". In this situation, the problem can be remedied 
by assuming that Gho is empty, that is: 

Assumption 5. (Separation of support) 3G C Q such that Uggci? = H 
and Ug^afJ ~ H'^ . 

For many settings with overlap, this is an overly restrictive assumption. 
Note that this assumption corresponds to assuming the groups are cor- 
rect in the non-overlapping grouped lasso. If the groups are incorrect, the 
result of this proposition gives us some insight as to what goes wrong 
asymptotically. 

2. The result gives a consequence of having an "incorrect" set of groups, rel- 
ative to the support of fP . When the condition V g £ Gho ; = of 
Assumption 4 is violated, we have that TC/'^\\l3'^'"^\y is no longer Op(l) 
for g e Gho, and the consequence is similar to the previous remark. Again, 
we get the wrong asymptotic mean, and the estimator does not have good 
selection properties. Such a violation Assumption 4 implies that the struc- 
ture implied by Q is not sufficient to capture the structure in 
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3. These results exclude some types of structures: in particular nested groups 
in Q. In particular, the uniqueness assumption implies that we can not 
use a Q which contains nested groups. In this case, given a set of groups, 
the uniqueness condition of Assumption 4 are violated for some /S*^. For 
example, suppose p = 5 and 

g = {{l,2},{3,4},{l,2,3,4},{5}}. (39) 

Then, for /?" = [a,a, 0,0, c], then there are an infinite number of decom- 
positions minimizing the || • ||2,i,6- In particular, for any a £ (0, a), the 
following decomposition minimizes the norm: 

= [a -a, a -a, 0,0,0], (40) 

= [0,0,0,0,0], (41) 

= [a,a,0,0,0], (42) 

= [0,0,0,0,c]. (43) 

Then, consider the weights Xg = \\vq^^\\. In almost all data applica- 
tions we have: supp(/3'^^'^) D {1,2,3,4}. The minimizing decomposition 
of ||/3'^^'^||2,i,g will clearly have v^^^^} = '"{3^4} ^ 0- This effectively ex- 
cludes the first two groups, and we will be unable to detect all possible 
sparsity patterns. More generally, using the same argument as the exam- 
ple, we can state that in the case where groups are nested, there exist 
some /3° which cannot be uniquely decomposed to minimize the || • \ \2,i,g 
norm. Thus, using nested groups degrades the asymptotic guarantees of 
the overlapping grouped lasso. This property precludes using a complex 
nested set of groups to encode multiple structures. 



6. Simulation Study 

We now present the results of a simulation study to illuminate and support 
our earlier theoretical claims. For ease of comparison, we imitate the setting 
of Huang and Zhang (2010). Here, we explore issues most pertinent to the over- 
lapping groups lasso, leaving aside some of the issues addressed by the simulation 
study in Huang and Zhang (2010). We generate an n x p design matrix X with 
i.i.d. standard normal entries, with each row scaled so it has unit magnitude. 
We next generate a structured sparse /3° vector with the nonzero entries de- 
fined as the union of the first k groups from our set of groups G- We choose the 
first k groups to achieve a consistent amount of overlap in /J" with respect to 
Q between trials. We define k, Q, n, and p separately in each experiment. After 
constructing our response from X and f3'^, we add zero mean Gaussian noise 
with standard deviation a = 0.01. We compare the standard lasso against the 
overlapping groups lasso, with set of groups Q. As in Huang and Zhang (2010), 
we adopt the following metric to evaluate the performance of both estimators: 

Recovery Error : **^°~f**^ (44) 
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We conduct the foUowing pair of experiments: 

1. Study on the effect of overlap. Here, we simulate a problem that has nearly 
constant difficulty for the ordinary, un-grouped, lasso, but increasing dif- 
ficulty for the grouped lasso. We set p = 512, and set each group so 
that it consists of 8 consecutive (by index) predictors. We then vary 
^Overlap ^ {li2,...,8}. For example, with ^/Qverlap ~ 

two groups are gi = {1, 2, . . . , 8}; .92 = {9, 10, . . . , 16}, and with with 
^Overlap = 2, ffi = {1, 2, . . . , 8}; .92 = {8, 9, . . . , 15}, and so forth. We 
select k = ceiling((64 — 8)/(8 + ^Overlap)) ^ ^ groups to be nonzero in 
and set n = 192. 

2. Study on the effect of sample size. We adopt a similar setting of the first 
experiment. We set ^/Qverlap ~ ^' ^^^'^ G^P^ ^ in a similar manner as 
the first experiment. We consider n satisfying log2(?i/48) G {0, 1,2,3,4}. 

The purpose of the first experiment is to study the effect of increasing com- 
plexity of Q on estimation performance. For Q G {1,2,3,4}, we see that as the 
degree of overlap increases, the estimator performance degrades, though not 
dramatically in these settings. For Q = 5, with groups of size 8, we can see that 
due to the consecutive placement of the signal, about half of the groups may be 
dropped without degradation in performance, and we return to the setting and 
performance of = 1. For Q e {6,7,8}, the estimator again does worse than 
in the case of no overlap, but no worse than ^ = 4. This result supports the 
discussion surrounding Assumption 1 and Theorem 1, but still indicates that 
the procedure is more robust to overlap than postulated in Huang and Zhang 
(2010). 

In the sample size study, we see that for a reasonable (^overlap ~ ^) 
of groups, the estimator outperforms the lasso: it is able to achieve a limiting 
level of recovery error for lower sample sizes than the lasso. This supports the 
conclusions of Theorem 1, as well as the conclusions from the literature about 
the grouped lasso, e.g. Huang and Zhang (2010) and Lounici et al. (2009). We 
thus see that even in the overlap case, the procedure still enjoys a benefit due 
to group sparsity. 

7. Discussion and Conclusions 

In the previous two sections, we have given results on the performance of the 
overlapping grouped lasso in both the finite sample and asymptotic setting. One 
of the basic steps in practical applications of this procedure is the choice of the 
collection of groups Q. In both cases, we showed that an overly complex choice 
of G degrades the theoretical guarantees on the performance of the estimator. In 
the case where the dimension of the problem is fixed, increasing the number of 
groups leads to less tight upper bounds on both prediction and estimation in the 
finite sample case. In the asymptotic setting, nested groups lead to inconsistent 
selection of the true sparsity pattern. Nonetheless, when Q is suitably chosen, 
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Fig 1. Results of the simulation study. We compare the overlapping grouped lasso to the 
ordinary lasso. Left: study on the degree of overlap of the groups. Right: study on sample 
size. 



wc still see that the procedure retains the theoretical benefits of the grouped 
lasso demonstrated in previous literature. 

In summary, we find that the overlapping grouped lasso is a useful extension 
of the grouped lasso that must be used with caution. The flexibility allowed 
by overlapping groups is valuable in many applications, and can encode a wide 
variety of structures as collections of groups. We have shown that allowing for 
overlap does not remove many of the theoretical properties and benefits proven 
for the lasso and grouped lasso. However, the procedure must be used with 
caution. While the flexible nature of the procedure suggests that the analyst 
may encode many structures simultaneously, this approach is not supported by 
the results in this paper. 



Appendix A: Proofs 

A.l. Finite Sample Result 

A. 1.1. Auxiliary Lemmas 

Lemma 2. Let be a chi-squared random variable with D degrees of freedom. 
Then: 

/I r ^2^\ 

(45) 

Proof. See Lemma A.l from Lounici et al. (2009). □ 



f 1 . 




mm < 


[4}) 


V 8 
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Lemma 3. Let a, /? G 



then, yQ: 



(46) 



Proof. Let {v^*} denote any decomposition of a minimizing the norm ||p||2,i,e- 
Then, beginning with Holder's inequahty the following chain gives the desired 
result: 



a^/3< Halloo 11/31 12 



< 



g 

1/2 

overlap 



max Qfo 2 

V 9 



EH 



^1/2 

overlap 



maxllofglls ) ||/3||2,i,e- 



(47) 
(48) 

(49) 

(50) 
□ 



Lemma 4. Consider the model in Equation 1. Suppose \Q\ = M > 2, and n > I. 
Assume that the entries of e are i.i.d. Gaussian with mean and variance a^. 
Let X he normalized so that the the diagonal entries of X.jn are all equal to 
1. Let {v^~^^ denote a decomposition of (i — j3 minimizing the \ \ ■ ||2.i,e norm. 
Let J ~ J{P'^) ~ {g '■ ¥^ ^} be the set of groups that are nonzero in the norm 
minimizing decomposition of /?. Let: 



2cr^maxg \g\ 



A log AI 
^maxg \g\ 



1/2 



(51) 



Here, A > 8. Define q = min |^yl^ming |5|/8,81ogA/j . Then, with probability 

at least 1 — for any solution f3 to Equation 1^, for all /3 E M.P , the following 

inequality holds: 



-||X(^- /?")|p + A||^- /3||2.i,e < -||X(/3 - /30)||2 
n n 



Proof. We follow the proof strategy of Lounici et al. (2009). For all (3 g 
have: 

-||X^-y||2 + 2Ap||2,i,e < -||X/3-2/||2 + 2A||/3||2,i,e 
n n 



(52) 



we 



(53) 



Let y = + e to obtain: 

-||X(^- /30)||2 < i||X(/3 - /?")|p + ^e^X0-P) + 2A (||/3|| 

Th Tt Th 



2,i,e 



(54) 
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We now examine the second term on the right hand side: 

2^1/2 

-e^X0-l3) < ^^ZEl^P Lax||e^X,||) (55) 
n n \ g ) 



overlap 



max 




(56) 



Here, we apply our version of Holder's inequality (Lemma 3). We now consider 
the event: 





( 




max . 







jes / / ^^verlap. 

Note that random variables V^(j) = ^Y^i=\ -^ij^ij where g{j) denotes the jth 
element of g €z G, arc standard Gaussian random variables. Within a group, they 
have a multivariate normal distribution with covariance matrix XjXg/((T^n), 
where Xg denotes the sub matrix of X consisting of the columns indexed by the 
group Xg. It then follows that, provided Xg admits a Cholesky decomposition, 
that (XjXg)-i/2x^e/cr2n is a vector of i.i.d. standard normal random variables. 
Thus, letting pg denote the maximal absolute eigenvalue of (XgXg)"^/^, we 
have llXgc/cr^nll < pg\\(X.J'X.g)~^^'^'Kge/a'^n\\ by properties of the operator 
norm of (XjXg)i/2. Now, for any g e define: 



= '"V fi + rlj^V''. (58) 



Note, V.g e Q; jg < A. Now: 



2 



^ ^ / '^^overlap j ^ Isl 40-2^;^^^^^^^ j 



= P (x|g| > P^'i\9\ + A^logAf)) 

(61) 

< exp I - g ^ min { A log A/} j 

(62) 

/ ming[p-2]AlogAf . r/ . ^\ 

< exp I '^—^ mm < I mm V Isl I i ^ l^S 

(63) 
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In the above, we used Lemma 2 for the probabihty bound on variables. We 
now apply the union bound to obtain: 

Y^A") < M exp ^-^^ — min <^ min ] ,A\ogM}] (64) 



^ 9 

< M^-'> (65) 
Now, on the event A, we can obtain, from Equation 54: 

i||X(^-/?0)||2 + A||^-/3||2,i,e< (66) 

i||X(/3 - /3")||2 + 2A (p - /3||2,i,e + ||/?||2,i,e - Pl|2,i,e) (67) 

<i||X(/3-/30)|p+4A5]||«p|| (68) 

Where {v^~^} denotes a decomposition oi (3 — (3 minimizing the || • ||2,i,e norm. 
Note that the last line follows from the fact that || • ||2,i,e obeys the triangle 
inequality. This gives us the desired result in Equation 52. □ 

A. 1.2. Proof of Theorem 1 

Again, we follow the proof strategy of Lounici et al. (2009). Fix a decomposition 
of /3°: Let J = J(/3") ^ {g : ^ 0}. Let the event A in Lemma 4 hold 

and let /3 = /?" in the inequality 52: 

A||^-/3°||2,i,e<4A^||z;'^-^"||, (69) 
ge.] 

=^ Ell-'"'"ll^3 5:i|«^"-^"||. (70) 

seJ"^ geJ 

Thus, we can apply Assumption 1 with A = (/? — /3°), V(A) — {v^"^"} to obtain: 

v||.^~-^" <\m^m (71) 

Again, when the event A in Lemma 4 hold and for /3 = /3'^ in the inequality 52: 
-\\X0-/3'')\\<AXj2\\v^-^°\\ (72) 

g£.J 

4A - 

< \\X{13-13^)\\ (73) 
K^Jn 

=^ J^||X(/3-/3")|p<^ (74) 
=^ i||X(^-/?")||2< ^ LaxIgl+^logM^ (75) 
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This corresponds to the result in Equation 22. Equation 23 follows from an 
analogous chain as the above, beginning with the inequality 69. 

A. 2. Asymptotic Setting 

Before we prove the main result, wc give the following lemma. 

Lemma 5. Let Assumption 4 hold. For g <E Gho n'^^'^{\\v^^^\\)'' is Op(l) for 
7 > 0. 

Proof. By Assumption 4, we may denote {Vg} = V(/3°) as the unique decompo- 
sition minimizing the norm ||/3''||2,i,g- To make the dependence on n explicit, we 
denote 13^^^ as the ordinary least squares estimate for /J" using n data points. 
We know P^^^ — > /3° in probability, as n — )■ cxd. By Assumption 4, there ex- 
ists an TV such that, with high probability, P^^^ has a unique decomposition 
for all n > N. We denote this unique decomposition as: {v^^"^'"} = V{f3^^^), 
minimizing ||/3^^^||24,e- 

We next write 13^^^ = 13'^ + 6n, and then define the decomposition w^" = 
Vg — u^^'^". Recall that for g e Gh„, wc have \\Vg\\ ~ and furthermore 
||/3°'^'^||2,i,e ||;5°||2,i,e in probability. Thus, considering the terms in ||/3^''"'^||2,i,6 
corresponding to those g G we conclude \ \Vg" \ \ — )■ in probability as n — > oo 
for g e GHa ■ 

Finally, for g e G^„: V^{\\vfLS\\) ^ - = v^(| |«,*" || - 0) e 

Op{l). The result then follows for 7 > by the continuous mapping theorem. □ 



A. 2.1. Proof of Theorem 2 

Wc follow the general proof strategy of Theorem 3.2 from Nardi and Rinaldo 
(2008), which is adapted from similar results on the lasso from Fu and Knight 
(2000) and Zou (2006). First, define f3„ = 13° + Let {v°} = V(/3"); {v^} = 

V(/3") be decompositions of minimizing ||/3°||2,i,e, and ||/3n||2.i,e, respec- 
tively. Therefore, the following is a decomposition of u: V g E Q , Vg ~ y/n{Vg — 

To begin, we write the objective from Equation 15 (multiplied by ^) as: 



Qn{u) 



-.Xu + e 
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Let: 



Dn{u) ^ Qn{u) - Qn{0) 



2n \/n 



9 



,0 



1 



Kll 



We now proceed to examine the terms in the second summation. The behavior 
of these terms depends on the group g: 

• For g e Gh, we have A„ — > l/||w°||2 in probabihty, by the uniqueness of 
the decomposition {v'^} along with Assumption 4. Also: 



'^9r-°9 



Since ^/nX = o(l), then the term l2.n,g ^ 
For g e Ghc, n'y/^\\vf^^\\'-' = 0^(1) and: 

1 



Since, n'^'^'^^^^^X oo, then l2.n,g oo. 

For g e Gho, and n''^^\\vf^^\\2 is Op{l) by Lemma 5. As before, 



So l2,n,g oo. 

Now, Ii^n — > Mu — u'^W, where W ^ Np{0,<7^Ai). Since p is fixed and 
finite, then it follows that Dn{u) — > D{u), where: 



1„,T 

D{u) - 2 



u^Mu - u^W if V.g iGu-K 



else 



Now,w = {MrjIW.QY minimizes D{u) and so by the argmax theorem from (van der Vaart and Wellner, 
1998, CoroUary 3.2.3), the resuU follows. 
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